Book review statistics

About the data

There is a Goodreads API and an R interface to accompany it, however we will be using the data set found on Kaggle here because it is proving far too difficult to gather and process data from the Goodreads API.

First we will import the data and take a look at its contents.

books <- read_csv("data/books.csv") %>%
    rename(rating = average_rating,
           total_ratings = ratings_count,
           lang = language_code,
           pages = num_pages) %>%
    na.omit()
books
## # A tibble: 8,469 x 12
##    bookID title authors rating isbn  isbn13 lang  pages total_ratings
##     <dbl> <chr> <chr>    <dbl> <chr> <chr>  <chr> <dbl>         <dbl>
##  1      1 "Har… J.K. R…   4.57 0439… 97804… eng     652       2095690
##  2      2 "Har… J.K. R…   4.49 0439… 97804… eng     870       2153167
##  3      4 "Har… J.K. R…   4.42 0439… 97804… eng     352          6333
##  4      5 "Har… J.K. R…   4.56 0439… 97804… eng     435       2339585
##  5      8 "Har… J.K. R…   4.78 0439… 97804… eng    2690         41428
##  6      9 "Una… W. Fre…   3.74 0976… 97809… en-US   152            19
##  7     10 "Har… J.K. R…   4.73 0439… 97804… eng    3342         28242
##  8     12 "The… Dougla…   4.38 0517… 97805… eng     815          3628
##  9     13 "The… Dougla…   4.38 0345… 97803… eng     815        249558
## 10     14 "The… Dougla…   4.22 1400… 97814… eng     215          4930
## # … with 8,459 more rows, and 3 more variables: text_reviews_count <dbl>,
## #   publication_date <chr>, publisher <chr>

So it looks like we are working with 12 variables, and 8469 observations. Some of these variables are redundant and, for our purposes, can be ignored.

books$bookID <- NULL
books$isbn <- NULL
books$isbn13 <- NULL

Publication language

Since we speak English we’d like to focus solely on English language books. But first let’s see how many of them we have

books %>%
    subset(is.na(as.numeric(lang))) %>%
    subset(!is.na(lang)) %>%
    group_by(lang) %>%
    tally() %>%
    arrange(desc(n)) %>%
    slice(1:10) %>%
    ggplot(aes(x = factor(lang, levels = lang), y = n, fill = lang)) +
    geom_bar(stat = "identity") +
    labs(title = "Title number of books by language",
         x = "language",
         y = "number of books") +
    theme(axis.text.x = element_text(angle=90, hjust=1))

So the overwhelming majority of the data is in eng so we don’t loose much by ignoring the other publications. However, it looks like US and Great Britain version of books fall under a different language tag (and if we search hard Canadian too). For our purposes "en-CA",“en-GB”, anden-US` are the same so let us clean up the data a little.

books$lang <- books %>%
    pull(lang) %>%
    str_replace_all("en*(g|-GB|-US|-CA)", "eng")
eng_books <- books %>% filter(lang == "eng" & rating >= 2.5)

Next let’s get an idea of the distribution of ratings for English language books.

eng_books %>%
    group_by(rating) %>%
    summarize(total_books = n()) %>%
    ungroup() %>%
    ggplot(aes(x = rating)) +
    geom_histogram(aes(y = total_books/1.00001, fill = rating),
                   stat = "identity",
                   freq = FALSE) +
    labs(title = "Total number of books per rating",
         x = "average rating",
         y = "number of books")

And what about publications per year?

eng_books <- eng_books %>%
    mutate(publication_year = as.numeric(str_sub(publication_date, -4)))
eng_books %>% group_by(publication_year) %>%
    summarize(total_books = n()) %>%
    ungroup() %>%
    ggplot(aes(x = publication_year, y = total_books)) +
    geom_point() +
    xlim(1900, 2020) +
    labs(x = "publication year",
         y = "books published")

books_per_year <- eng_books %>% group_by(publication_year) %>%
    summarize(total_books = n()) %>%
    arrange(desc(total_books)) %>%
    ungroup()
model_publish_books <- books_per_year %>%
    filter(publication_year <= 2006) %>%
    lm(log(total_books) ~ publication_year, .)

filtered_model <- books_per_year %>%
    filter(publication_year > 2006) %>%
    lm(log(total_books) ~ publication_year, .)

model_publish_books %>% tidy()
## # A tibble: 2 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      -159.       9.26        -17.1 8.62e-26
## 2 publication_year    0.0818   0.00470      17.4 3.47e-26
filtered_model %>% tidy()
## # A tibble: 2 x 5
##   term             estimate std.error statistic     p.value
##   <chr>               <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)       690.      72.3         9.54 0.000000593
## 2 publication_year   -0.341    0.0359     -9.51 0.000000615
estimate_2020 <-
    predict(model_publish_books,
            newdata = data.frame(publication_year = seq(2015,2020)))

Prolific writers

Who are the top rated authors? Naively we can look at the average rating of each author as follows.

eng_books %>%
    group_by(authors) %>%
    summarize(average_rating = sum(rating)/n()) %>%
    arrange(desc(average_rating)) %>%
    ungroup() %>%
    slice(1:10)
## # A tibble: 10 x 2
##    authors                                              average_rating
##    <chr>                                                         <dbl>
##  1 Chris    Green/Chris Wright/Paul Douglas Gardner                  5
##  2 James E. Campbell                                                 5
##  3 Julie Sylvester/David Sylvester                                   5
##  4 Keith Donohue                                                     5
##  5 Laura Driscoll/Alisa Klayman-Grodsky/Eric     Weiner              5
##  6 Middlesex Borough Heritage Committee                              5
##  7 Ross Garnaut                                                      5
##  8 Sheri Rose Shepherd                                               5
##  9 Todd Davis/Marc Frey                                              5
## 10 William C. Dowling                                                5

who are these people?

So highly rated authors are not necessarily notable ones, perhaps the sum total of the ratings is a better indicator of good author.

prolific_writer_names <- eng_books %>%
    group_by(authors) %>%
    summarize(book_count = n(),
              sum_rating = sum(rating)) %>%
    arrange(desc(sum_rating)) %>%
    slice(1:10) %>%
    pull(authors)

prolific_writers <- eng_books %>% filter(authors %in% prolific_writer_names)

prolific_writers %>%
    (function(...) {
        tmp <- tibble(...)
        tmp$authors <- tmp$authors %>%
            str_replace_all("Margaret Weis/Tracy Hickman",
                            "Weis/Hickman")
        tmp}) %>%
    group_by(authors) %>%
    summarize(book_count = n(),
              sum_rating = sum(rating)) %>%
    arrange(desc(sum_rating)) %>%
    slice(1:10) %>%
    ggplot(aes(factor(authors, levels = authors), sum_rating, fill = authors)) +
    geom_col() +
    coord_flip() +
    labs(title = "Author rankings",
         y = "author",
         x = "sum rating") +
    scale_fill_brewer(palette="Spectral")

OK, this is much better and most of them are recognizable. Now let’s take a look at their average rating.

prolific_writers %>%
    (function(...) {
        tmp <- tibble(...)
        tmp$authors <- tmp$authors %>%
            str_replace_all("Margaret Weis/Tracy Hickman",
                            "Weis/Hickman")
        tmp}) %>%
    group_by(authors) %>%
    summarize(avg_rat = sum(rating)/n()) %>%
    ungroup() %>%
    ggplot(aes(x = authors, y = avg_rat, fill = authors)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 10 average author rating",
          x = "auhor",
          y = "average rating") +
    theme(axis.text.x = element_text(angle = 90))

Modeling Stephen King

So Stephen King is the most prolific writer in this data set and also has fairly good ratings. In this section we will take a look at models for predicting the popularity of a hypothetical future publication of Stephen King.

stephen_king <- books %>% filter(authors == "Stephen King" &
                                 title != "Blood and Smoke" &
                                 title != "LT's Theory of Pets")
sk_model <- lm(rating ~ pages, stephen_king)
sk_model %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 3.62      0.111        32.7  1.03e-26
## 2 pages       0.000795  0.000217      3.67 8.61e- 4
predict(sk_model,
        newdata = data.frame(pages = c(300)))
##        1 
## 3.855764
stephen_king %>% summarize(mean = mean(pages))
## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  478.

Publishers

Next we will at the top 10 publishers by rating. We will first need to turn the publisher name into a factor to more easily work with it.

eng_books$publisher <- factor(eng_books$publisher)

Next let us see who the top 10 publishers are.

top_10_publishers <- eng_books %>%
    group_by(publisher) %>%
    summarize(total_books = n()) %>%
    arrange(desc(total_books)) %>%
    slice(1:10) %>%
    pull(publisher)

eng_books_tp <- eng_books %>%
    filter(publisher %in% top_10_publishers)

Next let us see how the popularity of their publications have changed since 2000.

eng_books_tp %>%
    filter(publication_year >= 2000 &
           publication_year <= 2007) %>%
    ggplot(aes(x = fct_reorder(publisher, rating),
               y = rating, color = publisher)) +
    geom_boxplot() +
    labs(title = "Publisher rating in {closest_state}",
         x = "publisher") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    transition_states(publication_year,
                      transition_length = 2,
                      state_length = 4) +
    enter_fade() +
    exit_shrink() +
    ease_aes('sine-in-out')

Two publishers stand out from among the rest, VIZ Media LLC and Harper Collins. VIZ media publishes graphic novels and comic books which have a higher average rating than books.